High-throughput sequencing and genomes

Jelmer Poelstra

CFAES Bioinformatics Core, Ohio State University

2026-01-29

Introduction to
sequencing technologies

What do we mean by sequencing?

What does the term sequencing, like in “high-throughput sequencing”, generally refer to? Determining the nucleotide sequence of DNA fragments: the order of A, Cs, Gs, and Ts.


And what about RNA? RNA is usually reverse transcribed to DNA (cDNA) prior to sequencing, as in nearly all “RNA-Seq”.

Sequencing technologies: overview

  • Sanger sequencing
    Sequences a single, typically PCR-amplified, short-ish (≤900 bp) DNA fragment at a time

  • High-throughput sequencing (HTS)
    Sequences 105-109, usually randomly selected, DNA fragments (“reads”) at a time — two types:

    • Short-read HTS: Cheaper, more accurate, but shorter reads
    • Long-read HTS: More expensive, less accurate, but longer reads

Sequencing technology development timeline


Modified after Pereira, Oliveira, and Sousa (2020)

Sequencing technology development timeline


Modified after Pereira, Oliveira, and Sousa (2020)

Sanger sequencing

Sequences a single, typically PCR-amplified, short-ish (≤900 bp) DNA fragment at a time.

Sequencing is performed by synthesizing a new DNA strand with fluorescently-labeled nucleotides, using a different color for each base (A, C, G, T).


The final result is a chromatogram that can be “base-called”:

https://dnacore.mgh.harvard.edu/new-cgi-bin/site/pages/sequencing_pages/seq_troubleshooting.jsp


The entire human genome (3 Gbp) was sequenced with Sanger technology!

Anyone want to guess how much this may have cost?

Sequencing cost through time

https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost

Present-day Sanger applications

With HTS, DNA can be sequenced much more efficiently and cheaply, and Sanger sequencing has become less widely used.


But it is not obsolete, in part because high throughput isn’t always needed –
some present-day uses of Sanger sequencing:

  • Examining variation among individuals or populations in one or more candidate or marker genes (for population genetics, phylogenetics, functional inferences, etc.)

  • Taxonomic identification of samples

High-throughput sequencing (HTS)

Omics

Looking at the bigger picture first, HTS produces data that underlies several of these main “omics” approaches:

A diagram showing the main omics data types.

Copyright ThermoFisher

The main omics data types

Omics type Molecule type
Genomics DNA
Epigenomics DNA modifications High-throughput sequencing (HTS)
Transcriptomics RNA
Proteomics Proteins
Metabolomics Metabolites


What does the -omics suffix mean

The “omics” suffix indicates the involvement of large-scale datasets — in the sense that, for example, “genomics” data typically spans much or all of the genome.

While the boundaries can be fuzzy, sequencing a single gene in a single organism is not genomics, and running qPCR for a handful of genes is not transcriptomics.

The main omics data types (cont.)

Omics type Molecule type Data mainly produced by
Genomics DNA High-throughput sequencing (HTS)
Epigenomics DNA modifications High-throughput sequencing (HTS)
Transcriptomics RNA High-throughput sequencing (HTS)
Proteomics Proteins Mass Spectometry
Metabolomics Metabolites Mass Spectometry

Examples of HTS applications

  • Whole-genome assembly
  • Analysis of SNPs and other sequence variants
    (For population genetics/genomics, GWAS, etc. – often referred to as “resequencing”)
  • RNA-Seq (transcriptome analysis)
  • Microbial community characterization
    • Metabarcoding
    • Shotgun metagenomics

Illustration of HTS

[ILLUSTRATION OF A TANGLED MASS OF READS - Recall that “reads” are sequenced fragments of DNA]

  • Assemble (“build back”) into a single sequence that can be used e.g. as a reference

  • Compare specific sequence variants across multiple samples

  • Count the number of reads originating from distinct units, such as genes (RNA-Seq) or organisms (microbial community characterization)

Two key variables in HTS

  • Read lengths
  • Error rates

Read lengths

HTS read lengths vary from 300 bp and shorter (short-read HTS) up to tens of thousands of base pairs (long-read HTS).


Can you think of applications where long reads are useful?

For example:

  • Genome assembly
  • Taxonomic identification of single reads (microbial metabarcoding)

Can you think of applications where read length may not matter much?

For example:

  • (SNP) variant analysis
  • Counting application such as RNA-seq, …. TBA

Error rates

Currently, no sequencing technology is error-free: the sequenced read may differ from the actual DNA sequence it came from.

  • The read can have base-calling errors, missing bases, or extra bases
  • When the base calling software is not confident, it can also return Ns (= undetermined)

A chromatogram with several uncalled bases.

When you receive HTS reads, base calls have typically been made already.
Every base call is accompanied by a quality score, representing the estimated error probability.

Correcting sequencing errors

To overcome sequencing errors, every base can be sequenced multiple times –
i.e., obtaining a “depth of coverage” greater than 1:

A diagram illustraing the concept of depth of coverage.

Typical depths of coverage are ~50-100x for genome assembly and 10-30x for “resequencing” (!)


Which natural phenomenon might complicate this effort? Genetic variation among and (for diploid organisms) within individuals

The main HTS technologies

Short-read HTS Long-read HTS
Main companies Illumina Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio)

The main HTS technologies

Short-read HTS Long-read HTS
Usage More Less (but increasing)
Main companies Illumina Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio)
Timeline Since 2005 — technology fairly stable Since 2011 — still rapid development

The main HTS technologies

Short-read HTS Long-read HTS
Usage More Less (but increasing)
Main companies Illumina Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio)
Timeline Since 2005 — technology fairly stable Since 2011 — still rapid development
Read lengths 50-300 bp 10-100+ kbp
Error rates Mostly <0.1% 1-10% (ONT) / <0.1-10% (PacBio)
Throughput Higher Lower
Cost per base Lower Higher

The main HTS technologies

Short-read HTS Long-read HTS
Usage More Less (but increasing)
Main companies Illumina Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio)
Timeline Since 2005 — technology fairly stable Since 2011 — still rapid development
Read lengths 50-300 bp 10-100+ kbp
Error rates Mostly <0.1% 1-10% (ONT) / <0.1-10% (PacBio)
Throughput Higher Lower
Cost per base Lower Higher
AKA Next-Generation Sequencing (NGS) Third-generation sequencing

Illumina HTS

Illumina (short-read HTS / NGS)

  • 100-300 bp reads with 0.1-0.2% error rates

  • More reads, lower per-base cost, and generally lower error rates than long-read sequencing.

  • Machines differ in throughput, read length, cost per Gb:

Libraries and library prep

In a HTS context, a “library” is a collection of DNA fragments ready for sequencing.


In Illumina libraries, these fragments number in the millions or billions and are often simply randomly generated from input such as genomic DNA:

A diagram showing the main Illumina library preparation steps.

An overview of the library prep procedure. This is typically done for you by a sequencing facility or company.

Libraries and library prep (cont.)

After library prep, each DNA fragment is flanked by several types of short sequences that together make up the “adapters”:



Multiplexing!

Adapters can include so-called “indices” or “barcodes” that identify individual samples. That way, up to 96 samples can be combined (multiplexed) into a single library,
i.e. into a single tube.

Paired-end vs. single-end sequencing

DNA fragments can be sequenced from both ends as shown below —
this is called “paired-end” (PE) sequencing:

A diagram showing forward and reverse reads in paired-end sequencing.


When sequencing is instead single-end (SE), no reverse read is produced:

Insert size

  • The total size of the biological DNA fragment (without adapters) is often called the insert size:

Insert size variation

The insert size can vary – by design, but also because of limited precision in size selection. In some cases, it is:


Shorter than the combined read length, which leads to?

Overlapping reads (this can be useful!):

A diagram illustrating the scenario when the DNA fragment is shorter than the combined read length


Shorter than the single read length, which leads to?

Adapter read-through”: the final bases in the resulting reads will consist of adapter sequence, which should be removed before downstream analysis

A diagram illustrating the scenario when the DNA fragment is shorter than the single read length

How Illumina sequencing works

First, library fragments bind to a surface thanks to the adapters, and the DNA templates are then PCR-amplified to form “clusters” of identical fragments:

In the diagram above, for illustrative purposes:

  • Only a few nucleotides are shown (1 block = 1 nucleotide) — in reality, fragments are much longer
  • Only two templates and clusters are shown — in reality, there are millions

How Illumina sequencing works (cont.)

Then, sequencing is performed by synthesizing a new strand using fluorescently-labeled bases and taking a picture each time a new nucleotide is incorporated:

How Illumina sequencing works (cont.)

Video of Illumina technology

How errors come about in Illumina

  • The different templates within a cluster get out of sync because occasionally:
    • They miss a base incorporation
    • They incorporate two bases at once

  • Base incorporation may also terminate before the end of the template is reached

This error profile is why, for Illumina:

  • There are hard limits on read lengths
  • Base quality scores typically decrease towards the end of the reads

How Illumina sequencing works: Zooming out

How Illumina sequencing works: Zooming out

Long-read HTS

Long-read HTS

The technologies underlying the two main long-read HTS technologies are very different, but have some commonalities beyond long reads — they:


  • Perform “single-molecule” sequencing (no PCR amplification of library fragments)
  • Require higher quality & quantity of DNA (because of the lack of PCR)
  • Can detect some base modifications, like methyl groups

Error rates are changing

I mentioned earlier that long-read HTS has higher error rate than short-read (Illumina) HTS.

However, error rates in one type of PacBio sequencing where individual fragments are sequenced multiple times (“HiFi”) are now lower than in Illumina.

Nanopore sequencing

A single strand of DNA passes through a nanopore
the electrical current is measured, which depends on the combination of bases passes in the pore:

Video of Oxford Nanopore technology

ONT (Nanopore) sequencers

Under development!

ONT constantly releases new flow cells with updated technology, which have led to large decreases in error rates over the past decade — and even over the past two or so years.

(Reference) Genomes

Reference genomes

Many HTS applications either require a “reference genome” or involve its production. What exactly does reference genome refer to? It usually includes:

  • An assembly
    A representation of most or all of the genome DNA sequence: the genome assembly

  • An annotation
    Provides e.g. locations of genes and other genomic “features” in the corresponding genome assembly, and functional information for these features


Taxonomic identity

Reference genomes are typically applicable at the species level. For example, if you work with maize, you want a Zea mays reference genome. But:

  • If needed, it’s often possible to work with genomes of closely related species
  • Conversely, different subspecies/lines may have their own reference genomes

Genome size variation

https://en.wikipedia.org/wiki/Genome_size

https://en.wikipedia.org/wiki/Genome_size

Genome structure

https://en.wikipedi.org/wiki/Karyotype





Key features:

  • Number of distinct chromosomes
  • Ploidy

Growth of genome databases


Konkel and Slot (2023)

Genome assemblies

  • With increasing usage & quality of long-read HTS, assemblies are getting better and better

  • For chromosome-level assemblies, i.e. with one contiguous sequence for each chromosome, additional technologies than sequencing are often needed (e.g. Hi-C, optical mapping)

  • Many assemblies are not “chromosome-level”, but consist of –often 1000s of– fragments (contigs and scaffolds). Even chromosome-level assemblies are not 100% complete.


Question: Contigs vs. scaffolds?

Contigs are contiguous, known stretches of DNA created by the assembly process, basically by overlapping reads.

Often, the order and orientation of two or more contigs is known, but there is a gap of unknown size between them. Such contigs are connected into scaffolds with a stretch of Ns in between.

How is this data stored?

Both genome assemblies and annotations are typically saved in a single text file each — we’ll explore some of these files in tomorrow’s lab.

Take-home messages

TBA

Up next

The Garrigós et al. 2025 dataset

The labs this and next week are organized around the data set from Garrigós et al. (2025):

A screenshot of the paper's front matter.

This paper uses paired-end Illumina RNA-Seq data to study gene expression in Culex pipiens mosquitos infected with two different malaria-causing Plasmodium protozoans.

Tomorrow’s lab

Next week’s content

Garrigós, Marta, Guillem Ylla, Josué Martínez-de la Puente, Jordi Figuerola, and María José Ruiz-López. 2025. “Two Avian Plasmodium Species Trigger Different Transcriptional Responses on Their Vector Culex pipiens.” Molecular Ecology 34 (15): e17240. https://doi.org/10.1111/mec.17240.
Konkel, Zachary, and Jason C. Slot. 2023. “Mycotools: An Automated and Scalable Platform for Comparative Genomics.” BioRxiv. https://doi.org/10.1101/2023.09.08.556886.
Pereira, Rute, Jorge Oliveira, and Mário Sousa. 2020. “Bioinformatics and computational tools for next-generation sequencing analysis in clinical genetics.” Journal of Clinical Medicine 9 (1). https://doi.org/10.3390/jcm9010132.